skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Ferdman, Michael"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Luiz Barroso started his career at Digital Equipment Corporation, investigating workload-optimized multiprocessor server architectures marketed to enterprises in the 1990s. These high-margin, low-volume products lost their market to more cost-effective enterprise servers built from high-volume desktop CPUs riding Moore’s law. The enterprise market has slowly transitioned to the cloud, where desktop PCs have formed the backbone of computing in data centers since the early 2000s to minimize cost and maximize the return on investment. Moving forward, with the absence of Moore’s law, future servers require a clean-slate, cross-stack design to scale in compute, communication, and storage capacity while reducing operational, capital, and environmental costs. 
    more » « less
  2. Yang, Chia-Lin (Ed.)
    Server applications exhibit a high degree of code repetition because they handle many similar requests. In turn, repeated execution of the same code, often with identical inputs, highlights an inefficiency in the execution of server software and suggests memoization as a way to improve performance. Memoization has been extensively explored in software, and several hardware- and hardware-assisted memoization schemes have been proposed in the literature. However, these works targeted memoization of mathematical or algorithmic processing, whereas server applications call for a different approach. We observe that the opportunity for memoization in servers arises not from eliminating the repetition of complex computation, but from eliminating the repetition of software orchestration code. This work studies hardware memoization in servers, ultimately focusing on one pattern, instruction sequences starting with indirect jumps.We explore how an out-of-order pipeline can be extended to support memoization of these instruction sequences, demonstrating the potential of hardware memoization for servers. Using 26 applications to make our case (3 CloudSuite workloads and 23 vSwarm serverless functions), we show how targeting just this one pattern of instruction sequences can memoize over 10% (up to 15.6%) of the dynamically executed instructions in these server applications. 
    more » « less
  3. Modern last-level caches are partitioned into slices that are spread across the chip, giving rise to varying access latencies dictated by the physical location of the accessing core and the cache slice being accessed. Although, prior work has shown that dynamically determining the best location for blocks within such Non-Uniform Cache Access architectures can provide significant performance benefits, current hardware does not implement this functionality. Instead, modern processors hash blocks across the LLC slices, obscuring the non-uniform architecture of the underlying cache and forfeiting the performance benefits of placing data in the nearest cache slices. Moreover, while prior work advocated improving performance by delegating control over block placement to the operating system at page granularity, modern processor hardware thwarts these approaches by hashing cache slice selection at cache block granularity. In this work, we make two observations that enable us to improve software performance on modern NUCA architectures. First, we find that software can undo the hashing performed by hardware and efficiently manage data placement at cache block granularity. Second, that the complexity of fine-grained data placement can be hidden from the developer by embedding it in the dynamic memory allocator. Leveraging these observations, we design a new specialized memory allocator, NUCAlloc, suitable for use with C++ containers such as std::map and std::set. NUCAlloc handles the complexity of NUCA-aware block placement, improving the performance of containers by placing their data into the nearest LLC slices. We demonstrate that our NUCAlloc prototype consistently outperforms std::allocator and jemalloc for LLC-resident containers, improving performance by up to 20% in both single-threaded and multi-threaded software. 
    more » « less
  4. State machine replication (SMR) is a core mechanism for building highly available and consistent systems. In this paper, we propose Waverunner, a new approach to accelerate SMR using FPGA-based SmartNICs. Our approach does not implement the entire SMR system in hardware; instead, it is a hybrid software/hardware system. We make the observation that, despite the complexity of SMR, the most common routine—the data replication—is actually simple. The complex parts (leader election, failure recovery, etc.) are rarely used in modern datacenters where failures are only occasional. These complex routines are not performance critical; their software implementations are fast enough and do not need acceleration. Therefore, our system uses FPGA assistance to accelerate data replication, and leaves the rest to the traditional software implementation of SMR. Our Waverunner approach is beneficial in both the common and the rare case situations. In the common case, the system runs at the speed of the network, with a 99th percentile latency of 1.8 µs achieved without batching on minimum-size packets at network line rate (85.5 Gbps in our evaluation). In rare cases, to handle uncommon situations such as leader failure and failure recovery, the system uses traditional software to guarantee correctness, which is much easier to develop and maintain than hardware-based implementations. Overall, our experience confirms Waverunner as an effective and practical solution for hardware accelerated SMR—achieving most of the benefits of hardware acceleration with minimum added complexity and implementation effort. 
    more » « less
  5. Abstract MotivationIn the past few years, researchers have proposed numerous indexing schemes for searching large datasets of raw sequencing experiments. Most of these proposed indexes are approximate (i.e. with one-sided errors) in order to save space. Recently, researchers have published exact indexes—Mantis, VariMerge and Bifrost—that can serve as colored de Bruijn graph representations in addition to serving as k-mer indexes. This new type of index is promising because it has the potential to support more complex analyses than simple searches. However, in order to be useful as indexes for large and growing repositories of raw sequencing data, they must scale to thousands of experiments and support efficient insertion of new data. ResultsIn this paper, we show how to build a scalable and updatable exact raw sequence-search index. Specifically, we extend Mantis using the Bentley–Saxe transformation to support efficient updates, called Dynamic Mantis. We demonstrate Dynamic Mantis’s scalability by constructing an index of ≈40K samples from SRA by adding samples one at a time to an initial index of 10K samples. Compared to VariMerge and Bifrost, Dynamic Mantis is more efficient in terms of index-construction time and memory, query time and memory and index size. In our benchmarks, VariMerge and Bifrost scaled to only 5K and 80 samples, respectively, while Dynamic Mantis scaled to more than 39K samples. Queries were over 24× faster in Mantis than in Bifrost (VariMerge does not immediately support general search queries we require). Dynamic Mantis indexes were about 2.5× smaller than Bifrost’s indexes and about half as big as VariMerge’s indexes. Availability and implementationDynamic Mantis implementation is available at https://github.com/splatlab/mantis/tree/mergeMSTs. Supplementary informationSupplementary data are available at Bioinformatics online. 
    more » « less
  6. null (Ed.)